The following data set contains metrics relating to posts from a renowned cosmetic brand on Facebook. The goal is to identify variables that have a big impact on target variable total interactions.
options(
digits = 2,
scipen = 999,
warn = -1
)
rm(
list = ls()
)
library(magrittr)
library(ggplot2)
Facebook <- readr::read_csv(
file = "/Users/thienpham/Data Mining/Data/Prepared_Facebook.csv",
col_types = "ncnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn"
) %>%
as.data.frame()
set.seed(1)
Facebook$partition <- sample(
x = c("train", "test"),
size = nrow(Facebook),
replace = TRUE,
prob = c(0.85,0.15)
)
numeric.pred <- c("page.like", "hour", "unique.views", "nonunique.views", "unique.clickers", "nonunique.clickers",
"total.clicks", "views.by.liker", "views.by.unique.likers", "clicks.by.liker", "comment", "like", "share")
v_cor <- sapply(
X = numeric.pred,
FUN = function(j) cor(
x = Facebook[,j],
y = Facebook$YJ.total.interactions,
method = "spearman"
)
)
v_cor <- v_cor[order(
x = abs(
x = v_cor
),
decreasing = TRUE
)]
Facebook$bin <- cut_number(
x = Facebook$total.interactions,
n = 3,
closed = "left"
)
for(j in names(v_cor)){
p <- ggplot(Facebook) +
aes(x = get(j),y = YJ.total.interactions,color = bin,fill = bin) +
geom_point() +
geom_smooth(color = "black",method = 'gam',formula = 'y ~ s(x, bs = "cs")') +
labs(
title = "Scatter plot of transformed total interaction by predictor variable",
subtitle = j,
caption = paste0("Spearman correlation: ",round(v_cor[j],3))
) +
xlab(j)
plot(
x = p
)
}
Based on the plots and spearman correlation coefficient, like has the strongest correlation to our target variable, followed by share, comment. This makes sense since the total interactions is composed of these 3 numbers. Afterwards, clicks by liker(someone who liked the page) has the most impact with correlation of 0.664. From these numeric predictor variables, the ones with the most potential would be like, share, comment, clicks.by.liker, nonunique.views, unique.views, and unique.clickers. Afterwards we start going into the 50s and lower spearman correlation.
categorical.pred <- c("Paid", "TypePhoto", "TypeStatus", "TypeVideo" ,"month10", "month11","month12","month2",
"month3","month4","month5","month6","month7","month8","month9", "Category2","Category3", "day2",
"day3", "day4" ,"day5","day6","day7")
kruskal <- sort(sapply(
X = categorical.pred,
FUN = function(x) kruskal.test(
x = Facebook$YJ.total.interactions,
g = Facebook[,x]
)$p.value
))
for(j in names(kruskal)){
p <- ggplot(Facebook) +
aes(x = factor(get(j)),y = YJ.total.interactions) +
geom_boxplot() +
labs(
title = "Box plot of transformed total interactions by predictor variable",
subtitle = j,
caption = paste0("P-value of Kruskal-Wallis test: ",signif(kruskal[j],1))
) +
xlab(j)
plot(p)
}
From the box plots shown, the variable with the lowest p value was month 3. This means that Facebook posts made during March had the most impact on total interactions. The next most impactful variable was a category 3 type post (inspirational post). From these categorical variables, potential variables would be month3, category3, day4, month9, category2, paid, typestatus, month5, month10. Afterwards the p value becomes equal to or greater than 0.05 which is normally the cutoff.
M_cor <- cor(
x = Facebook[,numeric.pred],
method = "spearman",
use = "pairwise.complete.obs"
)
M_CramerV <- DescTools::PairApply(
x = Facebook[,names(kruskal)],
FUN = DescTools::CramerV
)
corrplot::corrplot(
corr = M_cor,
diag = FALSE,
is.corr = FALSE
)
corrplot::corrplot(
corr = M_CramerV,
diag = FALSE,
is.corr = FALSE,
order = "hclust",
hclust.method = "ward.D"
)
For numeric variables we see a strong correlation between all of the variables relating to views, as well as all of the variables relating to clicks. For the categorical variable we notice a strong correlation between type status and type photo.
full <- YJ.total.interactions ~ page.like+hour+unique.views+nonunique.views+unique.clickers+
nonunique.clickers+total.clicks+views.by.liker+views.by.unique.likers+clicks.by.liker+comment+
like+share+Paid+TypePhoto+TypeStatus+TypeVideo+month10+month11+month12+month2+month3+month4+month5+month6+
month7+month8+month9+Category2+Category3+day2+day3+day4+day5+day6+day7
lm_full <- lm(
formula = full,
data = Facebook %>% dplyr::filter(
partition == "train"
)
)
summary(
object = lm_full
)
##
## Call:
## lm(formula = full, data = Facebook %>% dplyr::filter(partition ==
## "train"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.732 -0.079 -0.001 0.081 3.465
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.78280367 0.79777161 0.98 0.32709
## page.like 0.00000298 0.00000874 0.34 0.73367
## hour -0.00146623 0.00289198 -0.51 0.61245
## unique.views 0.00826812 0.07746886 0.11 0.91506
## nonunique.views 0.03932247 0.06808823 0.58 0.56393
## unique.clickers -0.42541339 0.19962729 -2.13 0.03372 *
## nonunique.clickers 0.35727347 0.17283673 2.07 0.03939 *
## total.clicks -0.01425450 0.04236576 -0.34 0.73671
## views.by.liker 0.15473266 0.11236399 1.38 0.16930
## views.by.unique.likers -0.28355952 0.13191968 -2.15 0.03222 *
## clicks.by.liker 0.20685913 0.10303814 2.01 0.04539 *
## comment 0.10157446 0.02123615 4.78 0.000002467169821 ***
## like 0.73065652 0.01552531 47.06 < 0.0000000000000002 ***
## share 0.12028154 0.01527099 7.88 0.000000000000035 ***
## Paid 0.02313242 0.02760493 0.84 0.40256
## TypePhoto -0.17961754 0.06665500 -2.69 0.00735 **
## TypeStatus -0.36486423 0.09360194 -3.90 0.00011 ***
## TypeVideo -0.29338924 0.12136735 -2.42 0.01610 *
## month10 -0.32489611 0.43679494 -0.74 0.45744
## month11 -0.16082666 0.44773321 -0.36 0.71964
## month12 -0.15395691 0.45735456 -0.34 0.73658
## month2 -0.13784074 0.09657727 -1.43 0.15432
## month3 -0.15663875 0.14140071 -1.11 0.26866
## month4 -0.14930975 0.21957134 -0.68 0.49691
## month5 -0.20494990 0.28078779 -0.73 0.46589
## month6 -0.19687840 0.34374146 -0.57 0.56715
## month7 -0.11756797 0.38083982 -0.31 0.75771
## month8 -0.27359331 0.40612307 -0.67 0.50093
## month9 -0.26827372 0.42798536 -0.63 0.53114
## Category2 0.10668183 0.03987065 2.68 0.00778 **
## Category3 0.04867124 0.03812519 1.28 0.20251
## day2 0.00942330 0.04516368 0.21 0.83483
## day3 0.05642003 0.04628725 1.22 0.22363
## day4 -0.02299033 0.04543163 -0.51 0.61312
## day5 -0.00191418 0.04646954 -0.04 0.96716
## day6 0.00164364 0.04405920 0.04 0.97026
## day7 -0.03311211 0.04338330 -0.76 0.44579
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.23 on 383 degrees of freedom
## Multiple R-squared: 0.991, Adjusted R-squared: 0.99
## F-statistic: 1.14e+03 on 36 and 383 DF, p-value: <0.0000000000000002
lm_main <- lm(
formula = YJ.total.interactions ~ unique.clickers + views.by.unique.likers + nonunique.clickers +
views.by.unique.likers + clicks.by.liker + comment + like + comment + share + TypePhoto +
TypeStatus + TypeVideo + Category2,
data = Facebook %>% dplyr::filter(
partition == "train"
)
)
summary(
object = lm_main
)
##
## Call:
## lm(formula = YJ.total.interactions ~ unique.clickers + views.by.unique.likers +
## nonunique.clickers + views.by.unique.likers + clicks.by.liker +
## comment + like + comment + share + TypePhoto + TypeStatus +
## TypeVideo + Category2, data = Facebook %>% dplyr::filter(partition ==
## "train"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.756 -0.069 0.003 0.067 3.765
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0366 0.1701 6.09 0.0000000026 ***
## unique.clickers -0.1984 0.1622 -1.22 0.2220
## views.by.unique.likers -0.0486 0.0232 -2.09 0.0369 *
## nonunique.clickers 0.2304 0.1341 1.72 0.0865 .
## clicks.by.liker 0.0388 0.0627 0.62 0.5364
## comment 0.1052 0.0202 5.21 0.0000003073 ***
## like 0.7293 0.0147 49.59 < 0.0000000000000002 ***
## share 0.1303 0.0133 9.80 < 0.0000000000000002 ***
## TypePhoto -0.1480 0.0641 -2.31 0.0215 *
## TypeStatus -0.2525 0.0879 -2.87 0.0043 **
## TypeVideo -0.1669 0.1151 -1.45 0.1477
## Category2 0.0779 0.0306 2.55 0.0112 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.24 on 408 degrees of freedom
## Multiple R-squared: 0.99, Adjusted R-squared: 0.989
## F-statistic: 3.57e+03 on 11 and 408 DF, p-value: <0.0000000000000002
M <- Facebook %>%
dplyr::mutate(
predict_full = predict(object = lm_full,newdata = Facebook),
predict_main = predict(object = lm_main,newdata = Facebook),
residual_full = predict_full - YJ.total.interactions,
residual_main = predict_main - YJ.total.interactions
)
M %>%
dplyr::select(
partition,YJ.total.interactions,predict_full,predict_main
) %>%
tidyr::gather(
key = "model",value = "predict",-partition,-YJ.total.interactions
) %>%
dplyr::select(partition,model,YJ.total.interactions,predict) %>%
dplyr::group_by(partition,model) %>%
dplyr::summarise(cor(YJ.total.interactions,predict)) %>%
tidyr::spread(key = partition,value = `cor(YJ.total.interactions, predict)`)
## `summarise()` has grouped output by 'partition'. You can override using the
## `.groups` argument.
## # A tibble: 2 × 3
## model test train
## <chr> <dbl> <dbl>
## 1 predict_full 0.994 0.995
## 2 predict_main 0.995 0.995
After only selecting the variables significant at less than alpha = 0.05 we can see that both the full and main effect model have pretty much the same training and test statistics. It would be favorable to go with the main effects model since we are producing the same results with way less variables. This is good for preventing overfitting, variable drift, leakage, and reduces computational cost while increasing model interpretability.
pred <- c("page.like", "hour", "unique.views", "nonunique.views", "unique.clickers",
"nonunique.clickers", "total.clicks", "views.by.liker", "views.by.unique.likers", "clicks.by.liker", "comment",
"like", "share", "Paid", "TypePhoto", "TypeStatus", "TypeVideo", "month10", "month11", "month12", "month2", "month3", "month4", "month5", "month6",
"month7", "month8", "month9", "Category2", "Category3", "day2", "day3", "day4", "day5", "day6", "day7"
)
caret::rfe(
x = Facebook[Facebook$partition == "train",pred],
y = Facebook$YJ.total.interactions[Facebook$partition == "train"],
rfeControl = caret::rfeControl(
functions = caret::lmFuncs
)
)
## Loading required package: lattice
##
## Recursive feature selection
##
## Outer resampling method: Bootstrapped (25 reps)
##
## Resampling performance over subset size:
##
## Variables RMSE Rsquared MAE RMSESD RsquaredSD MAESD Selected
## 4 0.454 0.933 0.312 0.3852 0.12340 0.3106
## 8 0.269 0.986 0.167 0.0691 0.00673 0.0216
## 16 0.262 0.987 0.155 0.0681 0.00632 0.0322
## 36 0.245 0.989 0.140 0.0774 0.00681 0.0260 *
##
## The top 5 variables (out of 36):
## unique.clickers, like, nonunique.clickers, views.by.unique.likers, month10
The selected features are a good way to see which variables are important but it is somewhat limiting. The other methods used above give a more complete picture of important or impactful variables.